{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Class02\n",
    "> *\"Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.\"* -- Wiki\n",
    "\n",
    "Data cleaning is an important prerequisite before your data analysis. Raw data, usually, contains inaccurate and irrevalent observations. Your data analysis is not reliable without proper data cleaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CRSP\n",
    "You should consider the following points before you use CRSP:\n",
    "* Data type\n",
    "* Stock exchanges\n",
    "* Share types\n",
    "* Duplicated observations\n",
    "* Negative price\n",
    "* Adjusted price\n",
    "* Industry"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "file_path = '/Users/ml/Google Drive/af/teaching/database/data/'\n",
    "msf_raw = pd.read_csv(file_path+'msf_2010_2017.txt',sep='\\t',low_memory=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PERMNO</th>\n",
       "      <th>date</th>\n",
       "      <th>SHRCD</th>\n",
       "      <th>EXCHCD</th>\n",
       "      <th>SICCD</th>\n",
       "      <th>NCUSIP</th>\n",
       "      <th>COMNAM</th>\n",
       "      <th>PERMCO</th>\n",
       "      <th>HSICCD</th>\n",
       "      <th>CUSIP</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10001</td>\n",
       "      <td>20100129</td>\n",
       "      <td>11.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4925</td>\n",
       "      <td>29269V10</td>\n",
       "      <td>ENERGY INC</td>\n",
       "      <td>7953</td>\n",
       "      <td>4925</td>\n",
       "      <td>36720410</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10001</td>\n",
       "      <td>20100226</td>\n",
       "      <td>11.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4925</td>\n",
       "      <td>29269V10</td>\n",
       "      <td>ENERGY INC</td>\n",
       "      <td>7953</td>\n",
       "      <td>4925</td>\n",
       "      <td>36720410</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10001</td>\n",
       "      <td>20100331</td>\n",
       "      <td>11.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4925</td>\n",
       "      <td>29269V10</td>\n",
       "      <td>ENERGY INC</td>\n",
       "      <td>7953</td>\n",
       "      <td>4925</td>\n",
       "      <td>36720410</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10001</td>\n",
       "      <td>20100430</td>\n",
       "      <td>11.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4925</td>\n",
       "      <td>29269V10</td>\n",
       "      <td>ENERGY INC</td>\n",
       "      <td>7953</td>\n",
       "      <td>4925</td>\n",
       "      <td>36720410</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10001</td>\n",
       "      <td>20100528</td>\n",
       "      <td>11.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4925</td>\n",
       "      <td>29269V10</td>\n",
       "      <td>ENERGY INC</td>\n",
       "      <td>7953</td>\n",
       "      <td>4925</td>\n",
       "      <td>36720410</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PERMNO      date  SHRCD  EXCHCD SICCD    NCUSIP      COMNAM  PERMCO HSICCD  \\\n",
       "0   10001  20100129   11.0     2.0  4925  29269V10  ENERGY INC    7953   4925   \n",
       "1   10001  20100226   11.0     2.0  4925  29269V10  ENERGY INC    7953   4925   \n",
       "2   10001  20100331   11.0     2.0  4925  29269V10  ENERGY INC    7953   4925   \n",
       "3   10001  20100430   11.0     2.0  4925  29269V10  ENERGY INC    7953   4925   \n",
       "4   10001  20100528   11.0     2.0  4925  29269V10  ENERGY INC    7953   4925   \n",
       "\n",
       "      CUSIP  \n",
       "0  36720410  \n",
       "1  36720410  \n",
       "2  36720410  \n",
       "3  36720410  \n",
       "4  36720410  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "msf_raw[msf_raw.columns[:10]].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "msf = msf_raw.copy()\n",
    "msf.columns = msf.columns.str.lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data type\n",
    "It is important to identify the type of the data, especially for variable with mix type. For example, return or price data should be numeric format. However, the variable will be text format if the variable contains both numeric and string data. For string variable, you cannot perform any calculation.\n",
    "\n",
    "Return data in CRSP contains missing codes, such as 'A', 'B', 'C', etc. So the return is in string format after you import raw data from CRSP."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "permno      int64\n",
       "date        int64\n",
       "shrcd     float64\n",
       "exchcd    float64\n",
       "siccd      object\n",
       "ncusip     object\n",
       "comnam     object\n",
       "permco      int64\n",
       "hsiccd     object\n",
       "cusip      object\n",
       "prc       float64\n",
       "vol       float64\n",
       "ret        object\n",
       "shrout    float64\n",
       "cfacpr    float64\n",
       "retx       object\n",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "msf.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **ret**, **retx**, **siccd**, **hsiccd** are string format. Actually, they should be numeric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "B    4393\n",
       "C    4023\n",
       "Name: ret, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "msf[msf['ret'].str.extract('([A-Z])',expand=False).notnull()]['ret'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **ret** contains 'B' and 'C' which are not valid return."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Z    111\n",
       "Name: siccd, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "msf[msf['siccd'].str.extract('([A-Z])',expand=False).notnull()]['siccd'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **siccd** contains 'Z' which is not valid SIC code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "msf_1 = msf.copy()\n",
    "for i in ['ret','retx','siccd','hsiccd']:\n",
    "    msf_1[i] = pd.to_numeric(msf_1[i],errors='coerce')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "permno      int64\n",
       "date        int64\n",
       "shrcd     float64\n",
       "exchcd    float64\n",
       "siccd     float64\n",
       "ncusip     object\n",
       "comnam     object\n",
       "permco      int64\n",
       "hsiccd    float64\n",
       "cusip      object\n",
       "prc       float64\n",
       "vol       float64\n",
       "ret       float64\n",
       "shrout    float64\n",
       "cfacpr    float64\n",
       "retx      float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "msf_1.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> After convertion, the four variables are numeric."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Stock exchanges\n",
    "Conventionall, we use three main stock exchanges in US market and **exchcd** allows us to find out them.\n",
    "\n",
    "| Stock exchange | exchcd |\n",
    "|----------------|--------|\n",
    "| NYSE           | 1      |\n",
    "| AMEX           | 2      |\n",
    "| NASDAQ         | 3      |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "msf_2 = msf_1[msf_1['exchcd'].isin([1,2,3])].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Share type\n",
    "CRSP includes different type of securities, for example, common stocks and ETFs. Common stocks are most widely used and **shrcd** = 10 or 11  can helps us to filter out common shares."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "msf_3 = msf_2[msf_2['shrcd'].isin([10,11])].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Duplicated observations\n",
    "CRSP contains duplicates, therefore, we have to remove them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "msf_4 = msf_3.drop_duplicates(['permno','date'],keep='last').copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Negative price\n",
    "CRSP uses average of bid and ask price to replace price if there is no valid closing price for a given data. To distinguish them, CRSP assigns negative sign (-) in front of the average of bid and ask price. It does not mean the price is negative. Therefore, we need to convert them to positive value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "msf_4['price'] = msf_4['prc'].abs()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Adjusted price\n",
    "CRSP returns (**ret**) already adjusted dividends and stock splits, and it also considers reinvestments effect. However, price in CRSP does not consider those corporate events. To adjust price, we can use the formula below:\n",
    "$$p_{adj} = \\frac{price}{cfacpr}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "msf_4['p_adj'] = msf_4['price'] / msf_4['cfacpr']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Time periods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "msf_p1 = msf_4[msf_4['date']<=20131231].copy()\n",
    "msf_p2 = msf_4[msf_4['date']>20131231].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Industry\n",
    "For industry related research, we need to know the industry classification. **siccd** indicates the industry group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "msf_5 = msf_4[(msf_4['siccd']<6000)|(msf_4['siccd']>6999)].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> The above excludes financial firms. (SIC code for financial firms are between 6000 and 6999)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary statistics\n",
    "Descriptive statistics can help us to check the distribution of variables and any extreme values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    293470.000000\n",
       "mean          0.011176\n",
       "std           0.162939\n",
       "min          -0.961852\n",
       "25%          -0.059090\n",
       "50%           0.005038\n",
       "75%           0.069364\n",
       "max          15.984456\n",
       "Name: ret, dtype: float64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "msf_5['ret'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert 8-digit CUSIP to 6-digit CUSIP\n",
    "8-digit CUSIP: stock (issue) level\n",
    "\n",
    "6-digit CUSIP: firm level"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cusip</th>\n",
       "      <th>date</th>\n",
       "      <th>cusip6</th>\n",
       "      <th>ret</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>36720410</td>\n",
       "      <td>20100129</td>\n",
       "      <td>367204</td>\n",
       "      <td>-0.018932</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>36720410</td>\n",
       "      <td>20100226</td>\n",
       "      <td>367204</td>\n",
       "      <td>-0.000656</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>36720410</td>\n",
       "      <td>20100331</td>\n",
       "      <td>367204</td>\n",
       "      <td>0.020643</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>36720410</td>\n",
       "      <td>20100430</td>\n",
       "      <td>367204</td>\n",
       "      <td>0.124385</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>36720410</td>\n",
       "      <td>20100528</td>\n",
       "      <td>367204</td>\n",
       "      <td>0.004829</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      cusip      date  cusip6       ret\n",
       "0  36720410  20100129  367204 -0.018932\n",
       "1  36720410  20100226  367204 -0.000656\n",
       "2  36720410  20100331  367204  0.020643\n",
       "3  36720410  20100430  367204  0.124385\n",
       "4  36720410  20100528  367204  0.004829"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "msf_5['cusip6'] = msf_5['cusip'].str[:6]\n",
    "msf_5[['cusip','date','cusip6','ret']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Annual return (December to December)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>permno</th>\n",
       "      <th>date</th>\n",
       "      <th>p_adj</th>\n",
       "      <th>yrm</th>\n",
       "      <th>year</th>\n",
       "      <th>p_adj_l1</th>\n",
       "      <th>ret</th>\n",
       "      <th>year_l1</th>\n",
       "      <th>yr_diff</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10001</td>\n",
       "      <td>20101231</td>\n",
       "      <td>10.52</td>\n",
       "      <td>201012</td>\n",
       "      <td>2010</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10001</td>\n",
       "      <td>20111230</td>\n",
       "      <td>11.42</td>\n",
       "      <td>201112</td>\n",
       "      <td>2011</td>\n",
       "      <td>10.52</td>\n",
       "      <td>0.085551</td>\n",
       "      <td>2010.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10001</td>\n",
       "      <td>20121231</td>\n",
       "      <td>9.33</td>\n",
       "      <td>201212</td>\n",
       "      <td>2012</td>\n",
       "      <td>11.42</td>\n",
       "      <td>-0.183012</td>\n",
       "      <td>2011.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10001</td>\n",
       "      <td>20131231</td>\n",
       "      <td>8.03</td>\n",
       "      <td>201312</td>\n",
       "      <td>2013</td>\n",
       "      <td>9.33</td>\n",
       "      <td>-0.139335</td>\n",
       "      <td>2012.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10001</td>\n",
       "      <td>20141231</td>\n",
       "      <td>11.02</td>\n",
       "      <td>201412</td>\n",
       "      <td>2014</td>\n",
       "      <td>8.03</td>\n",
       "      <td>0.372354</td>\n",
       "      <td>2013.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   permno      date  p_adj     yrm  year  p_adj_l1       ret  year_l1  yr_diff\n",
       "0   10001  20101231  10.52  201012  2010       NaN       NaN      NaN      NaN\n",
       "1   10001  20111230  11.42  201112  2011     10.52  0.085551   2010.0      1.0\n",
       "2   10001  20121231   9.33  201212  2012     11.42 -0.183012   2011.0      1.0\n",
       "3   10001  20131231   8.03  201312  2013      9.33 -0.139335   2012.0      1.0\n",
       "4   10001  20141231  11.02  201412  2014      8.03  0.372354   2013.0      1.0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "annual = msf_5[['permno','date','p_adj']].copy()\n",
    "annual['yrm'] = (annual['date']/100).astype(int)\n",
    "annual['year'] = (annual['yrm']/100).astype(int)\n",
    "annual = annual[annual['yrm']%100==12]\n",
    "annual = annual.sort_values(['permno','year']).reset_index(drop=True)\n",
    "annual['p_adj_l1'] = annual.groupby('permno')['p_adj'].shift(1)\n",
    "annual['ret'] = annual['p_adj'] / annual['p_adj_l1'] - 1\n",
    "annual['year_l1'] = annual.groupby('permno')['year'].shift(1)\n",
    "annual['yr_diff'] = annual['year'] - annual['year_l1']\n",
    "annual['ret'] = np.where(annual['yr_diff']!=1,np.nan,annual['ret']) \n",
    "annual.head()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
